Font group identification using reconstructed fonts

نویسندگان

  • Michael Patrick Cutter
  • Joost van Beusekom
  • Faisal Shafait
  • Thomas M. Breuel
چکیده

Ideally, digital versions of scanned documents should be represented in a format that is searchable, compressed, highly readable, and faithful to the original. These goals can theoretically be achieved through OCR and font recognition, re-typesetting the document text with original fonts. However, OCR and font recognition remain hard problems, and many historical documents use fonts that are not available in digital forms. It is desirable to be able to reconstruct fonts with vector glyphs that approximate the shapes of the letters that form a font. In this work, we address the grouping of tokens in a token-compressed document into candidate fonts. This permits us to incorporate font information into token-compressed images even when the original fonts are unknown or unavailable in digital format. This paper extends previous work in font reconstruction by proposing and evaluating an algorithm to assign a font to every character within a document. This is a necessary step to represent a scanned document image with a reconstructed font. Through our evaluation method, we have measured a 98.4% accuracy for the assignment of letters to candidate fonts in multi-font documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Bilingual Legacy-Fonts Identification and Conversion System

The digital text written in an Indian script is difficult to use as such. This is because, there are a number of font formats available for typing, and these font-formats are not mutually compatible. Gurmukhi alone has more than 225 popular ASCII-based fonts whereas this figure is 180 in case of Devanagari. To read the text written in a particular font, that font is required to be installed on ...

متن کامل

Font and Function Word Identification in Document Recognition

font would be used during recognition. This would reduce An algorithm is presented that identifies the predominant font in which the running text in an English language document the confusion caused by training on many fonts and would is printed. Frequent function words (such as the, of, and, a, effectively reduce the recognition problem to choosing the and to) are also recognized as part of th...

متن کامل

Font clustering and cluster identification in document images

In this work clustering and recognition problem of fonts in document images is addressed. Various font features and their clustering behavior are investigated. Font clustering is implemented both from shape similarity or from OCR performance points of view. A font recognition algorithm is developed that can identify the font group or the individual font from which a text was created. © 2001 SPI...

متن کامل

Generating Type 1 Fonts from METAFONT Sources

Nowadays most Printers demand PostScript files with scalable fonts instead of bitmapped fonts, as the later are not adequate for most cases. In addition, PDF files generated from PostScript files with embedded bitmapped fonts are poorly rendered on a computer screen. On the other hand, traditionally, PostScript files generated from TEX sources contained bitmapped fonts just because METAFONT gen...

متن کامل

Arabic Font Recognition using Decision Trees Built from Common Words

We present an algorithm for a priori Arabic optical Font Recognition AFR . The basic idea is to recognize fonts of some common Arabic words. Once these fonts are known, they can be generalized to lines, paragraphs, or neighbor non-common words since these components of a textual material almost have the same font. A decision tree is our approach to recognize Arabic fonts. A set of 48 features i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011